Lesson 2. Introduction to the sf package

In this lesson we’ll learn about sf, the package that is core to using geospatial data in R. We’ll go through the structure of the data (it’s not too different from regular data.frames!), geometries, shapefiles, and how to save your hard work.


Instructor Notes

2.1 What is sf?

sf = simple features

sf creates geospatial data.frames that retain all of the functionality of R data.frames, but which are extended with a geometry column and with geospatial metadata, making it easy to process your data using both standard table-based operations and explicitly geospatial operations.

Load sf

Let’s start by loading the sf library.

library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.2

2.2 Read in a shapefile

As we discussed in the initial geospatial overview, a shapefile is one type of geospatial data that holds vector data.

To learn more about ESRI Shapefiles, this is a good place to start: ESRI Shapefile Wiki Page

The tricky thing to remember about shapefiles is that they’re actually a collection of 3 to 9+ files together. Here’s a list of all the files that can make up a shapefile:

shp: The main file that stores the feature geometry

shx: The index file that stores the index of the feature geometry

dbf: The dBASE table that stores the attribute information of features

prj: The file that stores the coordinate system information. (should be required!)

xml: Metadata —Stores information about the shapefile.

cpg: Specifies the code page for identifying the character set to be used.

But it remains the most commonly used file format for vector spatial data, and it’s really easy to visualize in one go!

Let’s try it out with California counties, and use sf for the first time. sf::st_read is a flexible function that let’s you read in many different types of geospatial data.

# Read in the counties shapefile
counties = st_read('notebook_data/california_counties/CaliforniaCounties.shp')
## Reading layer `CaliforniaCounties' from data source `/home/drew/Desktop/stuff/berk/dlab/RGeo_rewrite/notebook_data/california_counties/CaliforniaCounties.shp' using driver `ESRI Shapefile'
## Simple feature collection with 58 features and 58 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -374445.4 ymin: -604500.7 xmax: 540038.5 ymax: 450022
## projected CRS:  NAD83 / California Albers
# Plot out California counties
plot(counties)
## Warning: plotting the first 9 out of 58 attributes; use max.plot = 58 to plot all

Wow! That gives us a plot grid of up to the first 9 attributes (i.e. columns) in our dataset.

And what if we just want to plot a single variable?

plot(counties['MED_AGE_M'])

Wow! So easy! We just made a choropleth map of median male age, by county!

We’re off to a running start.

2.3 Explore the sf data.frame

Before we get in too deep, let’s discuss what a sf data.frame is, and how it’s different from a plain data.frame.

The sf data.frame

An sf data.frame, a.k.a. an sf object, is just like a plain data.frame, but with an extra geometry column. The sf package then has a variety of geospatial functions that use that geometry column.

I repeat because it’s important:

An sf object is a plain data.frame with a geometry column added.

This means all the normal operations that we can run on data.frames will also work on sf objects!

With that in mind, let’s start exploring our sf object just like we would a dataset in a plain data.frame.

# Find the number of rows and columnds in counties
dim(counties)
## [1] 58 59
# Look at the first couple of rows in our sf object
head(counties)
## Simple feature collection with 6 features and 58 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -267387.9 ymin: -578158.6 xmax: 216677.6 ymax: 352693.6
## projected CRS:  NAD83 / California Albers
##   FID_        NAME STATE_NAME POP2010 POP10_SQMI POP2012  POP12_SQMI   WHITE  BLACK AMERI_ES
## 1    0        Kern California  839631      102.9  851089  104.282870  499766  48921    12676
## 2    0       Kings California  152982      109.9  155039  111.427421   83027  11014     2562
## 3    0        Lake California   64665       48.6   65253   49.082334   52033   1232     2049
## 4    0      Lassen California   34895        7.4   35039    7.422856   25532   2834     1234
## 5    0 Los Angeles California 9818605     2402.3 9904341 2423.264150 4936599 856874    72828
## 6    0      Madera California  150865       70.1  153025   71.065672   94456   5629     4136
##     ASIAN HAWN_PI HISPANIC   OTHER MULT_RACE   MALES FEMALES AGE_UNDER5 AGE_5_9 AGE_10_14
## 1   34846    1252   413033  204314     37856  433108  406523      72885   68694     68473
## 2    5620     271    77866   42996      7492   86344   66638      12877   11564     11324
## 3     724     108    11088    5455      3064   32469   32196       3633    3574      3888
## 4     356     165     6117    3562      1212   22416   12479       1625    1595      1853
## 5 1346865   26094  4687889 2140632    438713 4839654 4978951     645793  633690    678845
## 6    2802     162    80992   37380      6300   72682   78183      11983   11756     11755
##   AGE_15_19 AGE_20_24 AGE_25_34 AGE_35_44 AGE_45_54 AGE_55_64 AGE_65_74 AGE_75_84 AGE_85_UP
## 1     72493     65339    122046    108500    108479     77285     43502     23473      8462
## 2     11356     13158     25589     21878     20282     12924      6844      3809      1377
## 3      4190      3362      6603      7095     10255     10625      6553      3502      1385
## 4      2107      2831      6337      5513      5447      4113      1984      1041       449
## 5    753630    752788   1475731   1430326   1368947   1013156    568470    345603    151626
## 6     12224     11032     20562     19167     19291     15833      9868      5468      1926
##   MED_AGE MED_AGE_M MED_AGE_F HOUSEHOLDS AVE_HH_SZ HSEHLD_1_M HSEHLD_1_F MARHH_CHD MARHH_NO_C
## 1    30.7      30.2      31.4     254610      3.15      23580      25629     74254      58472
## 2    31.1      31.9      30.0      41233      3.19       3313       3884     12993       9299
## 3    45.0      43.8      45.9      26548      2.39       3913       3970      4061       7381
## 4    37.0      35.5      40.9      10058      2.50       1348       1231      2068       3094
## 5    34.8      33.6      35.9    3241204      2.98     360530     424398    790374     690291
## 6    33.1      31.5      34.3      43317      3.28       3342       3909     12660      12558
##   MHH_CHILD FHH_CHILD FAMILIES AVE_FAM_SZ HSE_UNITS VACANT OWNER_OCC RENTER_OCC NO_FARMS07
## 1     12615     28989   191739       3.61    284367  29757    152828     101782       2117
## 2      2115      4818    31939       3.59     43867   2634     22329      18904       1129
## 3       926      2051    16255       2.94     35492   8944     17472       9076        845
## 4       413       710     6800       2.98     12710   2652      6590       3468        459
## 5    115984    296976  2194080       3.58   3445076 203872   1544749    1696455       1734
## 6      2056      4020    34093       3.63     49140   5823     27726      15591       1708
##   AVG_SIZE07 CROP_ACR07 AVG_SALE07    SQMI CountyFIPS                  NEIGHBORS PopNeigh
## 1       1116     942827    1513.53 8161.35      06103 San Bernardino,Tulare,Inyo  2495935
## 2        603     512870    1203.20 1391.39      06089         Fresno,Kern,Tulare  2212260
## 3        147      28997      72.31 1329.46      06106                       <NA>        0
## 4       1000      82567     120.92 4720.42      06086                       <NA>        0
## 5         63      49158     187.94 4087.19      06073        San Bernardino,Kern  2874841
## 6        398     290683     579.70 2153.29      06102                Mono,Fresno   944652
##   NEIGHBOR_1 PopNeigh_1 NEIGHBOR_2 PopNeigh_2                       geometry
## 1       <NA>         NA       <NA>         NA MULTIPOLYGON (((193446 -244...
## 2       <NA>         NA       <NA>         NA MULTIPOLYGON (((12524.03 -1...
## 3       <NA>         NA       <NA>         NA MULTIPOLYGON (((-240632.1 9...
## 4       <NA>         NA       <NA>         NA MULTIPOLYGON (((-45364.03 3...
## 5       <NA>         NA       <NA>         NA MULTIPOLYGON (((173874.5 -4...
## 6       <NA>         NA       <NA>         NA MULTIPOLYGON (((16681.21 -1...
# Look at all the variables included in our data
colnames(counties)
##  [1] "FID_"       "NAME"       "STATE_NAME" "POP2010"    "POP10_SQMI" "POP2012"    "POP12_SQMI"
##  [8] "WHITE"      "BLACK"      "AMERI_ES"   "ASIAN"      "HAWN_PI"    "HISPANIC"   "OTHER"     
## [15] "MULT_RACE"  "MALES"      "FEMALES"    "AGE_UNDER5" "AGE_5_9"    "AGE_10_14"  "AGE_15_19" 
## [22] "AGE_20_24"  "AGE_25_34"  "AGE_35_44"  "AGE_45_54"  "AGE_55_64"  "AGE_65_74"  "AGE_75_84" 
## [29] "AGE_85_UP"  "MED_AGE"    "MED_AGE_M"  "MED_AGE_F"  "HOUSEHOLDS" "AVE_HH_SZ"  "HSEHLD_1_M"
## [36] "HSEHLD_1_F" "MARHH_CHD"  "MARHH_NO_C" "MHH_CHILD"  "FHH_CHILD"  "FAMILIES"   "AVE_FAM_SZ"
## [43] "HSE_UNITS"  "VACANT"     "OWNER_OCC"  "RENTER_OCC" "NO_FARMS07" "AVG_SIZE07" "CROP_ACR07"
## [50] "AVG_SALE07" "SQMI"       "CountyFIPS" "NEIGHBORS"  "PopNeigh"   "NEIGHBOR_1" "PopNeigh_1"
## [57] "NEIGHBOR_2" "PopNeigh_2" "geometry"

It looks like we have a good amount of information about the total population for different years and the densities, as well as race, age, and occupancy info.

2.4 Plotting sf objects

We’re able to map our sf object because of the extra geometry column.

sf Geometries

There are three main types of geometries that can be associated with your sf object: points, lines and polygons:

In an sf data.frame these geometries are encoded in a format known as Well-Known Text (WKT). For example:

  • POINT (30 10)
  • LINESTRING (30 10, 10 30, 40 40)
  • POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))

where coordinates are separated by a space and coordinate pairs by a comma

Your sf object may also include the variants multipoints, multilines, and multipolgyons if some of the features are composed multiple parts. For example, if we had data representing US states (one per row), we could use a POLYGON geometry for states like Utah or Colorado, but would need a MULTIPOLYGON for states like Hawaii, which includes many islands.

Question What kind of geometry would a roads sf object have? What about one that includes landmarks in the San Francisco Bay Area?

Just like with other plots we can make in R, we can start customizing our maps’ colors, title, etc.

The most rudimentary way to do this would be to use base R’s plot function to plot our sf objects geometries.

# Plot our geometries, coloring them pale, with dark green borders
plot(counties$geometry, col='tan', border='darkgreen', main="CA counties")

However, we’ll get much more customizability if we use a special-purpose mapping package, rather than just relying on sf methods of base R functions.

Our go-to mapping package of choice will be tmap. Its name stands for “thematic maps”, i.e. maps in which we use dimensions of our dataset to control the visualization parameters of our maps, thus creating effective data visualizations.

You’ll get plenty of introduction here in the workshop, but for additional support you can check out the tmap vignette or Google other tutorials and references.

Let’s start by loading the package and creating a ‘quick tmap’.

# load tmap
library(tmap)

# plot a 'quick tmap'
qtm(counties)

Nice!

That’s the quickest, simplest example of a static map that tmap can make. However, tmap has 2 modes: - ‘plot’ mode: static maps - ‘view’ mode: interactive maps

tmap loads up in ‘plot’ mode. Let’s switch it to ‘view’ mode and then take a look at that same map.

We could either set the mode explicitly (tmap_mode('view')) or just toggle back and forth using the ttm (‘toggle tmap mode’) function. Let’s use the latter.

# toggle the mode
ttm()
## tmap mode set to interactive viewing
# then make our quick tmap again
qtm(counties)

That’s outstanding! We get a clickable, scrollable, zoomable map built in Javascript’s Leaflet package… right out of the box!

And to create thematic maps, we can use tmap’s more verbose mapping functions to create a new tmap object and then add geometry layers to it, setting different aesthetic aspects of those layers.

For now, let’s recreate that same map from above, but this time using tmap instead of the sf method of base R’s plot.

tm_shape(counties) +  # use the `tm_shape` function to create a tmap object
  tm_polygons(col='tan', border.col='darkgreen', # add `tm_polygons` layer, coloring as before, 
              alpha=0.5) # & making transparent

Nice! Looks pretty much the same as above, except now it’s interactive, and overlaid on a super-sweet basemap!

Now we have two mapping methods: - base R plot: nice for simple, pared down plotting tasks - tmap: quick maps, both static and interactive, with greater flexibility

2.5 Subset the sf object

Since we’ll be focusing on Berkeley later in the workshop, let’s subset our sf object to just be for Alameda County.

# See the vector of all county names included in our dataset
counties$'NAME'
##  [1] "Kern"            "Kings"           "Lake"            "Lassen"          "Los Angeles"    
##  [6] "Madera"          "Marin"           "Mariposa"        "Mendocino"       "Merced"         
## [11] "Modoc"           "Mono"            "Monterey"        "Napa"            "Nevada"         
## [16] "Orange"          "Placer"          "Plumas"          "Riverside"       "Sacramento"     
## [21] "San Benito"      "San Bernardino"  "San Diego"       "San Francisco"   "San Joaquin"    
## [26] "San Luis Obispo" "San Mateo"       "Santa Barbara"   "Santa Clara"     "Santa Cruz"     
## [31] "Shasta"          "Sierra"          "Siskiyou"        "Solano"          "Alameda"        
## [36] "Alpine"          "Sonoma"          "Amador"          "Stanislaus"      "Sutter"         
## [41] "Butte"           "Calaveras"       "Tehama"          "Colusa"          "Trinity"        
## [46] "Tulare"          "Contra Costa"    "Del Norte"       "Tuolumne"        "Ventura"        
## [51] "El Dorado"       "Yolo"            "Fresno"          "Glenn"           "Yuba"           
## [56] "Humboldt"        "Imperial"        "Inyo"

It looks like Alameda county is specified as “Alameda” in this dataset.

counties[counties$NAME == 'Alameda',]
## Simple feature collection with 1 feature and 58 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -205958 ymin: -60728.39 xmax: -129669.6 ymax: -9909.857
## projected CRS:  NAD83 / California Albers
##    FID_    NAME STATE_NAME POP2010 POP10_SQMI POP2012 POP12_SQMI  WHITE  BLACK AMERI_ES  ASIAN
## 35    0 Alameda California 1510271     2029.8 1534551   2062.402 649122 190451     9799 394560
##    HAWN_PI HISPANIC  OTHER MULT_RACE  MALES FEMALES AGE_UNDER5 AGE_5_9 AGE_10_14 AGE_15_19
## 35   12802   339889 162540     90997 740573  769698      97652   94546     91070    100394
##    AGE_20_24 AGE_25_34 AGE_35_44 AGE_45_54 AGE_55_64 AGE_65_74 AGE_75_84 AGE_85_UP MED_AGE
## 35    107049    228204    227491    222617    173502     90437     52576     24733    36.6
##    MED_AGE_M MED_AGE_F HOUSEHOLDS AVE_HH_SZ HSEHLD_1_M HSEHLD_1_F MARHH_CHD MARHH_NO_C
## 35      35.6      37.5     545138       2.7      62350      79576    130014     123671
##    MHH_CHILD FHH_CHILD FAMILIES AVE_FAM_SZ HSE_UNITS VACANT OWNER_OCC RENTER_OCC NO_FARMS07
## 35     14625     41773   352423        3.3    582549  37411    291242     253896        525
##    AVG_SIZE07 CROP_ACR07 AVG_SALE07   SQMI CountyFIPS NEIGHBORS PopNeigh NEIGHBOR_1 PopNeigh_1
## 35        390      30549      95.92 744.06      06068      <NA>        0       <NA>         NA
##    NEIGHBOR_2 PopNeigh_2                       geometry
## 35       <NA>         NA MULTIPOLYGON (((-197580.8 -...

Now we can create a new sf object called alameda_county that is a subset of our counties geodataframe.

alameda_county = counties[counties$NAME == 'Alameda',]
# Plot our newly subsetted sf object
plot(alameda_county$geometry, col='pink', border='green', lwd=5, main='Why not?')

Nice! Looks like we have what we were looking for.

FYI: You can also make dynamic plots of one or more county without saving to a new gdf.

bay_area_counties = c('Alameda', 'Contra Costa', 'Marin', 'Napa', 'San Francisco', 
                      'San Mateo', 'Santa Clara', 'Santa Cruz', 'Solano', 'Sonoma')
qtm(counties[counties$NAME %in% bay_area_counties,])

2.6 Save your Data

Let’s not forget to save out our alameda_county object. This way we won’t need to repeat the processing steps and attribute join we did above.

We can save it as a shapefile.

st_write(alameda_county, "outdata/alameda_county.shp", delete_dsn=T)
## Deleting source `outdata/alameda_county.shp' using driver `ESRI Shapefile'
## Writing layer `alameda_county' to data source `outdata/alameda_county.shp' using driver `ESRI Shapefile'
## Writing 1 features with 58 fields and geometry type Multi Polygon.

One of the problems of saving to a shapefile is that our column names get truncated to 10 characters (a shapefile limitation.)

Instead of renaming all columns with obscure names that are less than 10 characters, we can save our sf object to a spatial data file format that does not have this limation - GeoJSON or GPKG (geopackage) file. - These formats have the added benefit of outputting only one file in contrast tothe multi-file shapefile format.

st_write(alameda_county, "outdata/alameda_county.json", driver="GeoJSON", delete_dsn=T)
## Deleting source `outdata/alameda_county.json' using driver `GeoJSON'
## Writing layer `alameda_county' to data source `outdata/alameda_county.json' using driver `GeoJSON'
## Writing 1 features with 58 fields and geometry type Multi Polygon.
st_write(alameda_county, "outdata/alameda_county.gpkg", driver="GPKG", delete_dsn=TRUE)
## Deleting source `outdata/alameda_county.gpkg' using driver `GPKG'
## Writing layer `alameda_county' to data source `outdata/alameda_county.gpkg' using driver `GPKG'
## Writing 1 features with 58 fields and geometry type Multi Polygon.

You can read these in, just as you would a shapefile, with st_read

gpkg_test = st_read("outdata/alameda_county.gpkg")
## Reading layer `alameda_county' from data source `/home/drew/Desktop/stuff/berk/dlab/RGeo_rewrite/outdata/alameda_county.gpkg' using driver `GPKG'
## Simple feature collection with 1 feature and 58 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -205958 ymin: -60728.39 xmax: -129669.6 ymax: -9909.857
## projected CRS:  NAD83 / California Albers
plot(gpkg_test)
## Warning: plotting the first 9 out of 58 attributes; use max.plot = 58 to plot all

json_test = st_read("outdata/alameda_county.json")
## Reading layer `alameda_county' from data source `/home/drew/Desktop/stuff/berk/dlab/RGeo_rewrite/outdata/alameda_county.json' using driver `GeoJSON'
## Simple feature collection with 1 feature and 58 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -205958 ymin: -60728.39 xmax: -129669.6 ymax: -9909.857
## projected CRS:  NAD83 / California Albers
plot(json_test)
## Warning: plotting the first 9 out of 58 attributes; use max.plot = 58 to plot all

There are also many other formats we could use for data output.

NOTE: If you’re working with point data (i.e. a single latitude and longitude value per feature), then CSV might be a good option!

2.7 Recap

In this lesson we learned about… - The sf package - Reading in shapefiles - st_read - sf data structures - dim, head, colnames, str - Plotting sf objects - plot - tmap - Subsetting sf objects - matrix subsetting syntax - Saving sf objects to file - st_write

Exercise: IO, Manipulation, and Mapping

Now you’ll get a chance to practice the operations we learned above.

In the following cell, compose code to:

  1. Read in the California places data (notebook_data/census/Places/cb_2018_06_place_500k.zip)
  2. Subset the data to Berkeley
  3. Plot using base R plot, and customize as desired
  4. Save out as a shapefile (outdata/berkeley_places.shp)

Note: pulling in a zipped shapefile has the same syntax as just pulling in a shapefile. The only difference is that insead of just putting in the filepath you’ll want to write zip://notebook_data/census/Places/cb_2018_06_place_500k.zip

To see the solution, double-click the Markdown cell below.

# YOUR CODE HERE

Double-click to see solution!


<div style="font-size:larger">&nbsp;D-Lab @ University of California - Berkeley</div>
<div>&nbsp;Team Geo<div>